Skip to content

Allow database dumps to be run against the read-only replica #2144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 5, 2020

Conversation

jtgeibel
Copy link
Member

@jtgeibel jtgeibel commented Jan 20, 2020

Edit: The description has been updated, with the original message included below.

This PR updates the database dump script so that it can be run against the read-only replica database.

I've tested locally, but we can manually schedule a database dump against the read-only replica and then update the existing daily scheduled task.

Original PR Description

In production, the scheduler was configured to run:

./target/release/enqueue-job dump_db $READ_ONLY_REPLICA

instead of:

./target/release/enqueue-job dump_db $READ_ONLY_REPLICA_URL

Because of the typo in the variable name, the dump_db job has been
running against the primary database. This patch modifies the fallback
to ensure the primary database is only used if explicitly provided.

In production, the scheduler was configured to run:

```
./target/release/enqueue-job dump_db $READ_ONLY_REPLICA
```

instead of:

```
./target/release/enqueue-job dump_db $READ_ONLY_REPLICA_URL
```

Because of the typo in the variable name, the dump_db job has been
running against the primary database.  This patch modifies the fallback
to ensure the primary database is only used if explicitly provided.
@rust-highfive
Copy link

r? @carols10cents

(rust_highfive has picked a reviewer for you, use r? to override)

@carols10cents
Copy link
Member

So it turns out it's actually good that the scheduler was configured incorrectly; the dump_db task can't currently run on read-only databases because it creates views :( (see crates-io-incident-response for more info)

@sgrif
Copy link
Contributor

sgrif commented Jan 20, 2020

We will need to rethink how this script works, it cannot currently be run against the replica (and changing the scheduler config to hit it caused the job to begin erroring). The issues we need to address are:

  • CREATE VIEW cannot be run in a read only transaction
  • SERIALIZABLE transactions are not permitted on a hot standby

This addresses the following error:

```
sql_error_code = 0A000 ERROR:  cannot use serializable mode in a hot standby
sql_error_code = 0A000 HINT:  You can use REPEATABLE READ instead.
sql_error_code = 0A000 STATEMENT:  BEGIN ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE;

```

Because our application does not rely on any other `SERIALIZABLE`
transactions, it should be fine to relax the isoluation level for
database dumps to `REPEATABLE READ`.

`DEFERRABLE` is dropped because it is only meaningful when combined
with `SERIALIZABLE`.
@jtgeibel jtgeibel changed the title Fallback to READ_ONLY_REPLICA_URL for db_dump Allow database dumps to be run against the read-only replica Jan 21, 2020
@jtgeibel
Copy link
Member Author

Sorry again for the confusion on this stealth configuration change. I should have announced it in the appropriate channel.

I've pushed 2 additional commits to avoid creating temporary views and to switch to the "REPEATABLE READ" isolation level.

I've tested locally (against a normal database), but we can manually schedule a database dump against the read-only replica and then update the existing daily task if that is successful.

@sgrif
Copy link
Contributor

sgrif commented Jan 21, 2020 via email

@@ -1,19 +1,7 @@
BEGIN;
BEGIN ISOLATION LEVEL REPEATABLE READ, READ ONLY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This provides somewhat weaker consistency guarantees for the snapshot used for the database dump. The Postgres documentation explicitly recommends using SERIALIZABLE READ ONLY DEFERRABLE for backups.

We don't seem to have a very strong case for moving the dump to the read-only replica, since it doesn't seem to have cause any problems on the main replica. Neither does it seem unreasonable to do so, even with the weaker consistency guarantees.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question – are we sure REPEATABLE READ will work on the hot spare? It isn't obvious why SERIALIZABLE READ ONLY would not work, so we should test this on stage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure REPEATABLE READ will work on the hot spare?

I definitely expect this to work. We can test by deploying the code, manually scheduling an export, verifying that everything works, and then finally updating the daily task.

It isn't obvious why SERIALIZABLE READ ONLY would not work

This definitely was a surprise to me as well. But I hope my other response helps explain why it isn't possible for the read-replica to be directly involved in cycle tracking on the leader. I've seen several suggestions that using DEFERABLE from a replica could be supported in future releases by writing some additional data in the WAL so that the follower can find a valid snapshot to start from. It appears no such enhancement has been implemented yet.

we should test this on stage.

We can't actually test this on staging, because Heroku only supports read-only replicas on paid database plans. But we shouldn't have any issues merging and deploying this change as we can roll it out via a configuration change to the scheduler.

We don't seem to have a very strong case for moving the dump to the read-only replica, since it doesn't seem to have cause any problems on the main replica.

I'm not entirely sure about that. I don't think any of the temporary response time spikes can be traced to the the export specifically, but I've definitely seen slight performance improvements and much less variation in response times after moving just a few expensive endpoints (#2073) to the read-only replica. In total, that PR seems to have moved only around 600 queries per hour off of the primary database, but the improvements seem impressive. It also appears to have eliminated the occasional spikes in download response times. I've collected some rough numbers and plan to follow up on that PR with more details, but I would definitely prefer to run the export from there as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm perfectly fine with moving to the read replica, and with the additional explanations you gave I'm actually actively in favour of doing so.

@@ -13,7 +13,7 @@ fn main() -> Result<(), Error> {
match &*job {
"update_downloads" => Ok(tasks::update_downloads().enqueue(&conn)?),
"dump_db" => {
let database_url = args.next().unwrap_or_else(|| env("DATABASE_URL"));
let database_url = args.next().unwrap_or_else(|| env("READ_ONLY_REPLICA_URL"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed we are passing in the replica URL on the command line in production.

Could we add something like this to .env.sample, to make local testing easier?

export READ_ONLY_REPLICA_URL=$DATABASE_URL

(I'm not sure variable expansion works in that file.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure variable expansion works in that file.

I think that does work, but I don't recall for certain. In this case, I think I would prefer to remove the fallback entirely and add an error message suggesting to pass $DATABASE_URL via the shell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@sgrif
Copy link
Contributor

sgrif commented Jan 21, 2020 via email

@jtgeibel
Copy link
Member Author

jtgeibel commented Jan 22, 2020

Long response incoming because this is subtle, I've done some recent reading, and I think it's worth documenting my current mental model.

Transaction Isolation

There are 3 isolation levels available in PostgreSQL:

  • READ COMMITTED - The default isolation level, and as far as I can tell only the db_dump logic uses a more strict level. The transaction has a "live" view of the database, and later queries can see the results of concurrent (committed) transactions that were not available to earlier queries (as it wasn't committed at that time).
  • REPEATABLE READ - The transaction takes a snapshot of the database upon first access, and concurrent writes by other transactions will not be seen by this transaction. If a transaction does some write operation, then a serialization error may be raised if our change conflicts with some other transaction (sort of like a git merge conflict). The application is expected to repeat the transaction, upon which it will start from a newer snapshot and hopefully does not conflict with newer concurrent writes (kind of like rebasing a PR).
  • SERIALIZABLE - This is the strictest level and ensures that (relative to other SERIALIZABLE transactions) it will be possible to write down some (i.e. at least one) sequence of transactions that explain the final state of the database. Transactions may still be run in parallel, but it appears as if the transactions had been done in series. Applications and their queries do not need to care about concurrency at all, but this comes at a performance cost and transactions must still be retried if rejected due to serialization errors.

Let's look at the final level a bit more closely.

SERIALIZABLE

The key example I've seen goes roughly as follows. You have two transactions running in parallel, with a cyclic data dependency such as a = b + 1 and b = a + 1. Then, if we start with let (a, b) = (0, 0); there are 3 possible outcomes for non-SERIALIZABLE transactions:

  • If the first transaction runs first, and then the second, we end in the (1, 2) state. ((0, 0) -> (1, 0) -> (1, 2))
  • If the transactions run sequentially in the opposite order, we end in the (2, 1) state. ((0, 0) -> (0, 1) -> (2, 1))
  • If the transactions overlap concurrently, it is possible they both start with (0, 0) as the snapshot state and then each write a 1 into the LHS of the assignment. The final state is (1, 1) and there is no way to explain this as if the transactions had occurred in series. In a sense, the database abstraction is leaky and the effects of concurrency become externally observable.

The SERIALIZABLE isolation level is all about eliminating the last bullet point. The database tracks things at runtime and will break data cycles by aborting one of the transactions so it can be retried (with a newer snapshot). The database tries to minimize this runtime cost, but some workloads may trigger pathological behavior (ex: false positives where a transaction is aborted when it could have been committed).

REPEATABLE READ is a pretty strong isolation level, but the lower level concurrency leaks into the application logic and it may be necessary to add explicit locking to queries. SERIALIZABLE does not expose the application to the results of concurrency, but it may not be as efficient as explicit locks.

In 2004 it was discovered that things can be even more subtle. It's possible to describe most concurrency edge-cases with 2 transactions, but this new anomaly requires 3 concurrent transactions to describe.

Serializable read-only anomaly

First, lets cover a bit of notation. We assume a single bank customer with two account balances (checking and savings). The database state is two atomic values represented as a tuple (checking, savings): (AtomicUsize, AtomicUsize). The Rust _ notation is used to show that there is no read/write operation to that value in a particular step. For example read (_, 10) means that 10 was read from savings and the value of checking was not read. write (5, _) means write a 5 to checking. The statement write (0, 0) is not valid, because it is not possible to modify multiple atomics simultaneously. Such a write would be split into 2 operations. Of course read (0, 0) is 2 sequential operations as well, this notation just reflects that the transaction doesn't care about the exact sequence.

The database starts in state read (0, 0), and there are 3 concurrent transactions:

  • The withdraw transaction wants to subtract $10 from checking. If the total account balance checking + savings becomes negative, then an overdraft charge of $1 is also withdrawn.
  • The deposit transaction wants to add $100 to savings.
  • The check_balance read-only transaction which reads both values.

The withdraw and deposit transactions write to separate values, and do not form a write-cycle as demonstrated previously. On their own, it appears the two transactions need no synchronization. The only data dependency is between a write to savings by deposit and a read from savings by withdraw. If the write occurs first, then no fee is charged and the final state is (-10, 100). If the read from withdraw occurs first then a fee is charged and the final state is (-11, 100).

Surprisingly, introducing the read-only check_balance transaction into the mix can cause a serialization anomaly unless handled with care. The scenario is as follows:

  • withdraw transaction - read (0, 0) and decide that an overdraft fee must be charged.
  • deposit transaction - read (_, 0), write (_, 100).
  • check_balance transaction - read (0, 100) - the customer observes a state where the deposit appears to have occurred first.
  • withdraw transaction - write (-11, _).

At the end of the month, the bank runs check_balance again and sends the customer a statement showing a balance of (-11, 100). This is consistent with the series of events where withdraw occurred first and was later followed by deposit. Conveniently, this interpretation aligns with the physical sequence of events where the read did occur before the write to savings.

Less conveniently, when the customer checked their balance earlier they saw an intermediate state where the deposit had cleared and no withdrawal had yet occurred. From their perspective, the only possible serial sequence that explained that state at the time was that any future withdraw would see the new balance and would not charge a fee.

To resolve this anomaly, if the first 2 bullet point steps have already occurred then the database has only two options:

  • Block the creation of the snapshot for check_balance until after withdraw is committed.
  • Reject either withdraw or check_balance and cause that client to restart from a newer snapshot.

This brings us to DEFERABLE. It only applies if the transaction is already SERIALIZABLE, READ ONLY and indicates to the database that we are willing to block initially, waiting for concurrent writes to finish, before creating a snapshot. This is a great optimization for the database, as it no longer needs to include reads from this transaction in its runtime SERIALIZABLE tracking.

Conclusion

I'll follow up on a few other comments, but overall my assessment is that running the database export with REPEATABLE READ will still provide stronger guarantees than what we currently expose via our API (READ COMMITTED). REPEATABLE READ still provides a consistent snapshot of the database state for the export, it just doesn't participate in any of the more complicated SERIALIZABLE runtime checks.

If we later determine our API queries need stronger guarantees somewhere, we would probably prefer to add explicit locking to those queries. Even if we found a use case where we preferred to use SERIALIZABLE in our application, we could probably just document that the database exports "don't participate in complex concurrency situations" and "may expose internal intermediate states that appear to conflict with the eventual future state of the database".

@smarnach
Copy link
Contributor

smarnach commented Feb 3, 2020

@jtgeibel Thanks for taking the time for a detailed answer! I agree with your conclusion – given that we don't have any other transactions with increased isolation levels, there currently shouldn't be an observable difference. My main concern when introducing the isolation level was to make sure that at least all foreign key relations still work, so it's possible to re-import the data into the database. With READ COMMITTED, not even that would be guaranteed, but REPEATABLE READ should be fine in that regard.

@jtgeibel
Copy link
Member Author

jtgeibel commented Feb 5, 2020

@bors r=smarnach

@bors
Copy link
Contributor

bors commented Feb 5, 2020

📌 Commit 4ae90e6 has been approved by smarnach

@bors
Copy link
Contributor

bors commented Feb 5, 2020

⌛ Testing commit 4ae90e6 with merge 7072305...

bors added a commit that referenced this pull request Feb 5, 2020
Allow database dumps to be run against the read-only replica

Edit: The description has been updated, with the original message included below.

This PR updates the database dump script so that it can be run against the read-only replica database.

I've tested locally, but we can manually schedule a database dump against the read-only replica and then update the existing daily scheduled task.

## Original PR Description

In production, the scheduler was configured to run:

```
./target/release/enqueue-job dump_db $READ_ONLY_REPLICA
```

instead of:

```
./target/release/enqueue-job dump_db $READ_ONLY_REPLICA_URL
```

Because of the typo in the variable name, the dump_db job has been
running against the primary database.  This patch modifies the fallback
to ensure the primary database is only used if explicitly provided.
@bors
Copy link
Contributor

bors commented Feb 5, 2020

☀️ Test successful - checks-travis
Approved by: smarnach
Pushing 7072305 to master...

@bors bors merged commit 4ae90e6 into rust-lang:master Feb 5, 2020
@jtgeibel jtgeibel deleted the prod/fix-db-dump-fallback branch March 8, 2020 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants